API & CONOPS Document | Participation Agreement | FAQ | irex@nist.gov


Introduction

The IREX 10: Identification Track assesses iris recognition performance for identification (a.k.a one-to-many) applications. Most flagship deployments of iris recognition operate in identification mode, providing services ranging from prison management, border security, expedited processing, and distribution of resources. Administered at the Image Group’s Biometrics Research Lab (BRL), developers submit their iris recognition software for testing over datasets sequestered at NIST. As an ongoing evaluation, developers may submit at any time.

Leaderboard

Two-eye Accuracy:

Accuracy Metric : FNIR (i.e., “miss rate”) at an FPIR of 0.01(± 90% confidence)
Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: Both eyes
Enrolled Population: 500K people
Enrollment Method: Both (left and right) iris images per enrollment template

The number after the ± indicates either the 90% confidence interval (for accuracy) or the standard deviation (for times and sizes).


Single-eye Accuracy:

Accuracy Metric : FNIR (i.e., “miss rate”) at an FPIR of 0.01(± 90% confidence)
Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: One eye
Enrolled Population: 1M irides (500K people)
Enrollment Method: One iris image per enrollment template
Accuracy Metric : FNIR (i.e., “miss rate”) at an FPIR of 0.01(± 90% confidence)
Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: Both eyes
Enrolled Population: 500K people
Enrollment Method: Both (left and right) iris images per enrollment template

Thresholded Accuracy

Core accuracy for the identification task can be characterized by Detection-error trade-off (DET) plots. Generally, curves lower down in a DET plot correspond to more accurate matchers. The plots are interactive through the use of the Plotly.js graphing library.

Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: Both eyes
Enrolled Population: 500K people
Enrollment Method: Both (left and right) iris images per enrollment template
Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: One eye
Enrolled Population: 1M irides (500K people)
Enrollment Method: One iris image per enrollment template

Rank Accuracy

Rank-based metrics are general better at reflecting performance for investigational tasks, where the algorithm returns a list of candidates for an inspector to further scrutinize. The rank 10 “hit rate” is the fraction of searches that return the correct candidate within the top 10 candidates. The miss rate is one minus the hit rate.

Two-eye Accuracy:

Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: Both eyes
Enrolled Population: 500K people
Enrollment Method: Both (left and right) iris images per enrollment template

Single-eye Accuracy:

Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: One eye
Enrolled Population: 1M irides (500K people)
Enrollment Method: One iris image per enrollment template

Computation Time

Computation times are measured as the the elapsed real time (i.e., “wall clock” time) as opposed to CPU time. Timing estimates were computed on unloaded machines with only a single process dedicated to biometric operations. The test machines are Dell PowerEdge M910 blades with Dual Intel Xeon X7560 2.3 GHz CPUs (with eight cores per processor).



Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: Both eyes
Enrolled Population: 500K people
Enrollment Method: Both (left and right) iris images per enrollment template


Previous IREX evaluations identified a speed-accuracy trade-off whereby the more accurate matchers tend to take longer to return search results. The plot below shows FNIR as a function of median search time for each matcher. FNIR computed at an FPIR of \(0.01\).

Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: Both eyes
Enrolled Population: 500K people
Enrollment Method: Both (left and right) iris images per enrollment template

Quality Assessment

Some of the participant’s submissions output estimates of sample quality for each processed iris image. The ANSI/NIST-ITL 1-2011 standard requires these estimates to be in the range 0 to 100 and to quantitatively express the predicted matching performance of the sample. Error-reject rate curves show how FNIR can be reduced by discarding the poorest quality samples in the test data. In our case, the quality of a search was set to the minimum quality assigned to the searched image and its enrolled mate.

The figure below demonstrates that FNIR (i.e. the ‘miss rate’) can be reduced by almost 20% by discarding just 1% of the poorest quality searches. Presumably, this 1% involved samples where the subject was blinking, moving, looking off-axis at the moment of capture, etc. The IREX III supplemental failure analysis found that matching failures for the most accurate matchers over a different dataset were almost entirely due to poor presentation of the iris.

Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: One eye
Enrolled Population: 1M irides (500K people)
Enrollment Method: One iris image per enrollment template


The stacked barplot below shows how sample quality impacts the probability that a search will miss (i.e. fail to return the correct mate). Samples assigned low quality values should be more likely to miss. For Neurotechnology’s matcher, when the assigned value is 0 the probability of a miss is greater than 50%. FPIR is set to \(0.01\).

Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: One eye
Enrolled Population: 1M irides (500K people)
Enrollment Method: One iris image per enrollment template

The sample quality of left and right iris images acquired during the same session are expected to be highly correlated. In addition to having similar capture environments, dual-eye cameras acquire both images at nearly the same instant so poor presentation of the irides at the moment of capture (e.g. blinking or moving at the moment of capture) detrimentally affects both images. For this reason, matching both acquired images vs. matching just one yields only a moderate improvement in accurary. The figure below shows the distribution of qualities with each axis represneting the quality of one of the iris images (left or right) acquired during the same capture session.

Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: One eye
Enrolled Population: 1M irides (500K people)
Enrollment Method: One iris image per enrollment template


The acquisition protocol for OPS4 images has probably improved over time. Better iris cameras and capture environments are likely to have improved the quality of the acquired images. Iris recognition accuracy is highly dependent on the prevalence of very poor quality samples. Misses tend to occur when the subject was blinking, moving, looking off-axis (etc.) at the instant of capture. The figure below shows the prevalence of these very low quality samples in OPS4 for each capture year. Comparatively few images in OPS4 were collected prior to 2014 so results for these images are omitted. An iris sample was deemed to have very low quality if its quality value is among the lowest 2% (i.e. below the 2% quantile) of all images in OPS4.

Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Samples used: One eye
Enrolled Population: 1M irides (500K people)
Enrollment Method: One iris image per enrollment template

Algorithm Fusion

Combining the results from multiple submissions sometimes yields improved accuracy over individual submissions. In this section score-level fusion is used to combine search results from multiple submissions. Equal-weighted Neyman-Pearson fusion is used to merge candidate lists from different submissions into a single consolidated candidate list. The dissimilarity score associated with each candidate is normalized prior to fusion (see LFAR score). This normalized score is a measure of similarity rather than dissimilarity. Any candidate appearing on multiple lists is assigned a single fused score by summing the the individual LFAR scores. The merged candidate list is then reordered by the LFAR scores.

Only fusion results that yield an improvement in accuracy over the individual submissions are shown.

Enrollment Size

Accuracy is impacted by the size of the enrollment database (a.k.a the gallery size). Identification of the correct mate is expected to be more difficult for larger enrollment database sizes. The figure below plots FNIR (at FPIR=\(0.01\)) as a function of enrollment database size.

Dataset: Operational Dataset 4th pull (stats on OPS4 images)
Accuracy Metric: FNIR (i.e., “miss rate”) at an FPIR of 0.01
Samples used: Both eyes
Enrollment Method: One enrollment session per person

Some apparant trends may be the result of random variation. Results for the 10K and 50K enrollment sizes were computed from 140K searches. Results for the 100K and 500K enrollment sizes were computed from 700K searches.

How to Participate

Participation is open to any commercial or academic organization free of charge. The first step is to email the signed Participation Agreement to NIST. Instructions on building a submission can be found in the API and concept of operations (CONOPS) document. The CONOPS document is supplemented by the frvt1N.h and frvt_structs.h. To assist with development, a minimal working “stub” (a.k.a null implementation) is also available. See also our FAQ.

Participants are allowed to submit an implementation once every 3 calendar months.

Please send comments and recommendations to irex@nist.gov.

Contact Info

Inquiries and comments may be submitted to irex@nist.gov. Subscribe to the IREX mailing list to stay up-to-date on all IREX-related activities.